Automatic Discovery of Adposition Typology
نویسندگان
چکیده
Natural languages (NL) can be classified as prepositional or postpositional based on the order of the noun phrase and the adposition. Categorizing a language by its adposition typology helps in addressing several challenges in linguistics and natural language processing (NLP). Understanding the adposition typologies for less-studied languages by manual analysis of large text corpora can be quite expensive, yet automatic discovery of the same has received very little attention till date. This research presents a simple unsupervised technique to automatically predict the adposition typology for a language. Most of the function words of a language are adpositions, and we show that function words can be effectively separated from content words by leveraging differences in their distributional properties in a corpus. Using this principle, we show that languages can be classified as prepositional or postpositional based on the rank correlations derived from entropies of word co-occurrence distributions. Our claims are substantiated through experiments on 23 languages from ten diverse families, 19 of which are correctly classified by our technique.
منابع مشابه
The Ising Model for Changes in Word Ordering Rule in Natural Languages
The order of ‘noun and adposition’ is the important parameter of word ordering rules in the world’s languages. The seven parameters, ‘adverb and verb’ and others, have a strong dependence on the ‘noun and adposition’. Japanese as well as Korean, Tamil and several other languages seem to have a stable structure of word ordering rules, as well as Thai and other languages which have the opposite w...
متن کاملAutomatic Discovery of Technology Networks for Industrial-Scale R&D IT Projects via Data Mining
Industrial-Scale R&D IT Projects depend on many sub-technologies which need to be understood and have their risks analysed before the project can begin for their success. When planning such an industrial-scale project, the list of technologies and the associations of these technologies with each other is often complex and form a network. Discovery of this network of technologies is time consumi...
متن کاملCrosslinguistic Computation and a Rhythm-based Classification of Languages
This paper is in line with the principles of numerical taxonomy and with the program of holistic typology. It integrates the level of phonology with the morphological and syntactical level by correlating metric properties (such as n of phonemes per syllable and n of syllables per clause) with non-metric variables such as the number of morphological cases and adposition order. The study of cross...
متن کاملMulti-source Cross-lingual Delexicalized Parser Transfer: Prague or Stanford?
We compare two annotation styles, Prague dependencies and Universal Stanford Dependencies, in their adequacy for parsing. We specifically focus on comparing the adposition attachment style, used in these two formalisms, applied in multisource cross-lingual delexicalized dependency parser transfer performed by parse tree combination. We show that in our setting, converting the adposition annotat...
متن کاملEpipaleolithic Site Discovery in Southeastern of Iran, Rayen
In the spring of 2012 an archaeological survey was conducted within the Rayen region in order to explore all the archaeological periods in this region and document all archaeological settlement patterns of this region. In general, 52 archaeological sites were discovered and documented which included all prehistoric and historic sites. Among these sites, three of them seem to belong to the Paleo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014